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m M Why do we care about DRAM ? 
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acenter as a Computer: An Introduction 


Source: The stems hac e Design of Warehouse-Scale Machines. 2009 
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= DRAM + Controller 
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Source: A Scalable Custom Simulation Machine for the Bayesian 


Source: Power Consumption of Green Wave Architecture 2011 Confidence Propagation Neural Network model of the Brain, 2014 
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DIMM Based: 


General Purpose Computers 
e.g. DDR3, DDR4 


Package on Package (PoP): 


Soldered on top of the MPSoC. 
Smartphones 
e.g. LPDDR3, LPDDR4 


Device Based: 


Embedded / Tablets / Graphic Cards 
e.g. LPDDR3, GDDR5 


FPGA 
or MPSoC 


Buffer on Board: 3D/2.5D-Integrated: Memory Cube: 


Memory Controller on Buffer Chip, 
Serial Connection 
e.g. FBDIMM, IBM CDIMM, Intel SMI/SMB 


Stacked on Logic or Silicon Interposer 


І 

I 

I 

i 

I 3D-Stacked, Memory Controller on 
by means of TSVs | 
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Bottom Layer, Serial Interconnect (SerDes) 


e.g. Wide 1/0, HBM e.g. HMG, SMC 


Silicon Interposer / Package Substrate 


Compute Logic 


Source: Matthias Jung 
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1000 


Comparison of DRAM Subsystems 


Best case - 
100% usage of the available BW 


Ш Energy 
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Energy per transfered bit (log,) [pJ/bit] 
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E.g. Google Tensor ASIC — roofline Model 
TPU Log-Log 
100 | 
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Operational Intensity: Ops/weight byte (log scale) 
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HBM2 and HMC can provide huge bandwidths. 
BUT, there is a price to pay ... 


»256 GB/s | -— 
| Beate Bandwidth / Cube Hybrid >240 GB/s 
Package substrate E 
Memory Bandwidth / Cube 
High Cube © < 
Bandwidth (HMC) | — 
Метогу —Щ 


(HBM2) 


PHY GPU/CPU/Soc Die 
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SerDes Buffers 
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Package Substrate 


Power: 40-50 GB/s/W / Cube Power: x20 W / Cube 


Avergae Power (mW) 
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DRAM part only power: 
for different page sizes and 


HMC Power >>11W 


М. Gokhale etal., 2015 


Total HMC Power 
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Link-power only 
is about 10-11W! 
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m мо DESIGN Detailed DRAM Energy Distribution 


= DRAM Power Breakdown for Twitter Memcached Application* 
= 2GB LPDDR3 (Low-Power DDR3 DRAM) 


= Refresh optimizations 
= Minimizing Row misses 


Important DRAM Commands 
= АСТ: Activates a specific row іп a specific bank (sensing into PSA) 


= Using 3D Inte 
= Maximize useful data 


= RD: Read from activated row (prefetch from PSA to SSA and burst out) 
= PRE: Precharges set LWL=0 set LBL=VDD/2 
= REFA: DRAM cells are leaky and have to be refreshed 


(*) A High-Level DRAM Timing, Power and Area Exploration Tool, O. Naji, A. Hansson, C. Weis, M. Jung, N. Wehn 
IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS), July 2015 
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tCL + tBURST 
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tREFI & tREF 


= ІЗ; 
ш MICROELECTRONIC 


ga SYSTEMS DESIGN How Refresh is Performed? 


AREF AREF AREF 
e DRAM controller sends AREF 


commands every ірегі (eg. 7.8 
us for Temp. < 85°C) 


° Single AREF command > 
refreshes multiple rows іп all СвЕС 
banks ( eg. ‘2’ rows іп all 8 
banks for 2Gb DDR3 DRAM) 


Crert CREF 


Bank 4 Bank5 Bank6 Bank 7 


Bank 0 Bank1  Bank2 Bank 3 
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Leakage current /uA 


Refresh/Temperature Challenge 


Exponential temperature/leakage current behavior 
> shorter refresh periods 


LPDDR3 Datasheet 


—@Leakage current [uA] 
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Refresh period / ms 


Temp / ?C 
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m сизен Refresh/Temperature Challenge 


Refresh Performance Impact 


Refresh Energy Overhead 
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2 Gb 4 Gb 8Gb 16Gb 32Gb 64Gb 2 Gb 4 Gb 


8Gb 16Gb 32Gb 64Gb 


Ш J. Liu, et al. RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012 
Ш | Bhati, et al. DRAM Refresh Mechanisms, Trade-offs and Penalties, IEEE Trans. 2015 
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= let my т, 


4 ТВ DDR3 DRAM 
Stand-by 300W (only Refresh) 


Paul Rosenfeld (IBM Server on display at Supercomputing) 
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me RE Refresh Optimization Techniques 


To counterbalance this trend for future devices and the 
higher temperatures in 3D-DRAMSs ... 


= Temperature-aware Bank-wise Refresh (detailed control) 
= Approximate DRAM 


= ORGR - Optimized Row Granular Refresh (only refresh 
data that is stored in an optimized way) 
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ва + + + + ++ + 4+ 44+ T Odi #® Different refresh rates on 
| < a | | қ нан a | : | different dies (bank groups), 
mn a НЕГЕ ТЕГЕ eese ЕС according to the temperature of 
—— | the die/bank 
| ' i | n ' | | a ' ' | = Each bank was equipped with a 
mm! = кошо оо: TS (Temp Sensor) 


Ous 5us 10us 15us 


3D-DRAM Cube 


ш Normal mBankwise 


AndEBench OxBench SmartBench 


M. Sadri, et al. Energy Optimization in ЗО MPSoCs with Wide-I/O DRAM Using Temperature Variation Aware Bank-wise 
Refresh, DATE 2014 13 
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Lowering or completely switching off refresh, accepting risk of data errors 
= Consider DRAM device as a stochastic model that includes process variations 


Data Lifetime 


Application 

Robustness Switch Off Refresh 

= If data lifetime is smaller 
than required refresh period 


= If data lifetime is larger than 
required refresh period AND 
application has some 
robustness w.r.t. errors 


Cumulative Failure Probability (log) 


Retention Time (log) 


Conservative Datasheet 
Refresh Period Guardband 
(i.e. 64ms) : Statistical retention error model and 


: measurements mandatory 
Required Refresh Period 
based on measurements 
(First errors happen after e.g. 1s) 


See 
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Simultaneous 
Sampling ADC 


Sense Amplifier 
Analog Signal 
Digital Signal 


High freq. 1.2GHz and higher: DDR4-2400 
Precise temperatures for heating up to 95°C 
Exact current measurements incl. VPP 
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Cumulative Failure Probability (log) 
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sstusose DDR4 Retention Time Measurements | 
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= Retention behavior depends on cell leakage (drain, sub-threshold, cell 


capacitor), cross talk, process variations, temperature, cell type 
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Random pattern 
worse than FF 
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== 30°C FF 


== 30°C Random 


== 60°C FF 


—#—60°C Random 


== 90°C FF 


—#—90°C Random 
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Cumulative Failure Probability (log) 


= Unsymmetrical error behavior dependent on cell type (true-cell, anti-cell) 


1E+00 
1E-01 
1E-02 
1E-03 
1E-04 
1E-05 
1E-06 
1E-07 
1E-08 
1E-09 
1E-10 
1E-11 
1E-12 


—®—60° 1 to 0 Flips 
—e— 30° 1 to 0 Flips 
=ж*—60° 0 to 1 Flips 


—t— 30° 0 to 1 Flips 


Random Pattern 
True-Cell DRAM 


0,1 1 10 100 
Refresh Interval [s] 


10 flip much more likely than 0 1 flip 
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Wb sistema eich Information Theory 


Consider memory as noisy communication channel 


| х (noisy) y : 
Transmitter Channel Receiver 


Symmetric retention behavior: Asymmetric retention behavior: 
Binary Symmetric Channel (BSC) Z-Channel 
0 E 0 0 0 

p p 

p 
1 = l 1 = 1 

E 1 
Gic ed HOD) C; = log; («a7 pr p) «1-240 


= Larger reliability if internal cell structure is known 
" More efficient ECC techniques possible 


" Appropriate data representation: e.g. small dynamic range C2 versus sign/magnitude 


m3 
Wis Approximate DRAM Simulation Framework 


ШИ RESEARCH GROUP 


Power Analysis 


System Behavior 
& Retention Error 
Model 


DRAM: 


Cores: 


Е s. asi 
Ae 
Refresh vs. Errors vs. Power 
vs. Performance 


% 


Impact on the Application 


cem DRAMSys 


Thermal Analysis Partners: 


Parameter Measurements ) 
& Model Calibration ЕС! 8 - 
| сеа leti 
ECOLE POLYTECHNIQUE 
FEDERALE DE LAUSANNE 


ARM 5упорѕуѕ 


DRAMMeasure 
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m STEMS DESIGN A Per Layer Refresh Policy for 32D DRAMs 


Separation of a 3D DRAM Stack into unreliable and reliable regions 


= Reliable regions: higher DRAM layers with temperature aware refresh 
= Unreliable region: bottom DRAM layer with disabled refresh > Omit Refresh (OR) 
= Access unreliable region while reliable region is refreshed 


Typical example applications © 
= Graph processing © 
= Image processing ші 
= Baseband processing £ 
сс 
о 
2. 


> Saves 100% refresh power in the unr.-layer Accelerators Accelerators 
> Increases bandwidth 
Dual Core GPU 


MPSoC with Channel Controllers 


Simulation Results 


Matthias Jung, et al. 2015. Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs. 
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= Selective Refresh 
= Retention Aware Refresh 
= Approximate DRAM 


Drawbacks of normal Auto-Refresh: 


үле Use 
= AREF lacks flexibility -- 
| Optimized 
= No access to internal refresh row " 
counter — ‘OW 
= No rows сап be skipped Granular 
" The complete DRAM has to be Refresh 


refreshed in the same rate 


m З; 
HE CA ORGR - Idea / Vendor Specific 


DQS 


"детіп timer at all vendors 
present (safety)! 


" But, vendor specific 


implementation tras 37.5 ns 
= Reverse Engineering p - 20.7 ns 


technique performed 
during init, boot or 
run-time 


Patent Pending 
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“ы ORGR - Benefits 
RGR (based on JEDEC timings) ORGR 


Bok OO OG ОФ Banko Ф 0O O 
Bank]! @ © об 
ва: © © OO о 
Bank3 o oo o © 0@ © = 
Bank4 o e АСТ зала OQ CO 000 

ens © O OO Baks @ ФОФО 
Bank6 (o oc Bakó @ (6020000 


Bank; Q9 o oo O Bn 0 © OOO 


ғ = 4.75 us 
; ен 75, = 0.375 ns 
fap = 12.518 г Ар 


Bank1 


Cree 158 ns Patent Pending Cree = 86.25 ns 
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Refresh jo nS Refresh 
Technique RFC Energy /т) 
Auto 
Refresh 262.5 186.24 
RGR 292.5 230.48 
ORGR 146.25 209.72 


е Measured for 4Gb x16 
DDR3 DRAM 


* Refreshing the complete 
DRAM 


ORGR - Validation/Reliability 


—9— RGR 
—9— ORGR 
=== Auto-Refresh 


30°C 


Cumulative Failure Probability (log) 


1 10 100 
Retention Time [s] (log) 


90°C 


—9— ORGR 


—e— Auto-Refresh 


Cumulative Failure Probability (log) 


1 10 100 
Retention Time [s] (log) 
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a "cwm — "Non-deterministic" DRAM Timing Behavior 
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Chstone ADPCM / DDR3 / BRC / FCFS Chstone ADPCM / DDR3 / RBC / 


500, 


P 
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= DRAM latency 
varies largely 
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о 
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— Application 


0 = at eee Н 
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ЕСЕ 7/) B — Add ress 
Co 


O 
Chstone ADPCM / wh | Chstone GSM / o Әсс Mapping 
“SS i — es -DRAM 


400r 


— Memory 
Controller 


" Similar variation in energy/DRAM access 
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Standard Mapping (BRC) 
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Application Aware Address Mapping 


Permutation-Based Page Interleaving 


Input Address 


Input Address 


4 


x 


Bank | 


Row 


Column 


Bank Interleaving (RBC) 


í 
R. Column 


Input Address 


wine sam d | 


R | Bank* Column 


Toggling Rate Analysis 


Bit 


| САМ 
€ ze. f? = 


versal Address Mapping 


Input Address 


Input Address 


Bank 


Column 


Input Address 


Any Bijective Mapping |4—— Application Knowledge 


Output Address 


. - ConGen Methodology 
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Exploit full application knowledge i.e. determinism of access pattern 
= Minimize #row misses in same bank 
= Decrease energy and latency, and increases bandwidth 


ConGen 


Bandwidth & Energy 


Application Specific Estimation Application Specific 
Memory Access Pattern Bc Memory Controller 
ASMC 
System 
Verilog 


Math. Optimization 


Optimization Problem 


= Minimize number of row misses for all DRAM banks over an given 
logical memory access trace 


= NP-hard problem 
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E RESEARCH GROUP ConGen Methodology - Results 


Industrial image processing task (Image Rotation 1024x 576 Pixel) 


mama BRC шини RBC шеше СопСсеп! =m ConGen2 "BRC ЕКВС ®ConGen1 #®ConGen2 
14.00 
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Energy [J] 
5 
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& 
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4.00 


2.00 


0.00 
32 64 


Pixel Size [bit] Pixel Size [bit] 


Legend: BRC = Bank-Row-Column, RBC = Row-Bank-Column address mapping 
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Architecture level design space 
exploration tool to estimate the 
Timings, Area, and Power for 
high density ReRAM crossbar 
devices (ReRAMSpec) 


System Level (SystemC) and 
behavioral (SystemVerilog) 
modelling of ReRAM devices 


Circuit level modelling with SPICE 
of the ReRAM Array cross-section 
and periphery (Drivers, Sense- 
Amps etc.) 


Heterogeneous Memory using ReRAM 
+ DRAM 
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= Each channel consists of 
DRAM (smaller capacity) and 
ReRAM (larger capacity) 

= Special Memory Controllers 
(MCs) needed / Hybrid 

Two Options: 

= DRAM as a Cache for RERAM 


= DRAM and ReRAM 
individually addressable 


3D-Stacked 
ReRAM 


3D-Stacked 
DRAM 


MPSoC with Memory Controllers 


Input of | | 
наалы REQ) REQ|REQ |REQ| RESH RESP RES? RESP 
ontroller | | <> 4 


АСТ RD | 
С Hm e em ТТ Definition of a new 
: "RI. GL 


protocol for NVM (e.g. 


Bank7 T 
NVRD NVRD NVWR NVRD (Retry) NVCEK ReRAM) necessa ry! 
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Bank 0 Bank 1 
32bit data per access 


Bank 0 Bank 2 Bank 0 Bank 2 
14bit data per access 


Voltages, IOs, Peri 


ReRAM 
TSVs 


i 


Voltages, 105, Peri Voltages, IOs, Peri 


+z2 
х/2 

(а wide 7210 data interface, 

supporting CacheTag or FCC storage, 


N x4 f 
ñean 4 bit dota bus over the TSVs} / 
ГА 


ReRAM die DRAM die 
for 3D-architecture for 3D-architecture 
ReRAM 
TSVs 
ReRAM/DRAM Power TS DRAM/ReRAM Power TSV: 
ReRAM Parameters for a Heterogeneous 3D Memory System DRAM Parameters for a Heterogeneous 3D Memory System 
Capacity of the Die 16 Gbit Capacity of the Die 8 Gbit + 1 Gbit (for Tag or ECC) 
Die Size 11.7 x 12.6 mm Die Size 9x 11.5 mm 
Number of Channels 4 Number of Channels 4 
Capacity/Channel and Layer (tier) size | 4 Gbit Capacity/Channel * Layer size 2 Gbit + 256 Mbit (for Tag or ECC) 
ReRAM/Logic Technology 28 nm DRAM Technology 22nm 
Number of Banks per Channel 4 Number of Banks per Channel 4 
IO width(s) 4 data lines Page size(s) 2K Bytes 
Interface and Frequency DDR, Prefetch 8, 500 MHz IO width(s) 72 data lines 
Maximum Bandwidth 1Gb/s per Pin Interface and Frequency DDR, Prefetch 4, 500 MHz 


Maximum Bandwidth 1 Gb/s per Pin 
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' 
Special DRAM / ReRAM-AT 
TLM Protocol 
— 


Special Handshake 


Protocol 


Hybrid Channel Controller 


Controller 
Backend 


ReRAM 


Arbitration 
& 
Mapping 


Pre-Recorded 


DRAM + 


Tracefiles Ш Hybrid Channel Controller ReRAM 
I 
Frontend : Backend ' Channels 
. I 
NVM Bank 3 3 
NVM Bank 2 3 
NVM Bank 1 3 ра oe 
NVM Bank о 4 Кто Ro 
E CTRD D 
NVM Cmd Bus 4 Ka Кз 
NVM Data Bus 4 £ <= 
DRAM k3 4 
DRAM к24 
DRAM k 1 3 5 D 
DRAM kod RE СТ) R 13 Ai D 
DRAM cmd Bus FO HSH ы e e le el i Ф | le 
DRAM Data Bus 3 KOA pa < Kem 42, 
3 INVI R 
te fed fed 
А RESP tal 
ста Bus 3 xo a 
Data Bus 4 + 4 KO 
ECC 4 E kel ЕЗ ка 
N 2500 i 4 2540 


T T 
2480 2520 2560 
Time in ns 


Trace player view of system simulation using our SystemC TLM 2.0 model 


Option 1 


DRAM and ReRAM separate 


addressable 


Option 2 


ReRAM is addressable 


DRAM is not directly 
addressable (Cache) 


Special Handshake Protocol 
(e.g. NVDIMM-P), to avoid 
unnecessary waits in case of 
Cache misses 


Special ReRAM protocol 
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" Acustom multi-chip design to simulate the human brain in real time using the 
spiking BCPNN (Bayesian Confidence Neural Network ) 

= The architecture for this algorithm is based on Hyper Columns Units (HCU) and 
Mini Columns units (MCU) 

= The parallel computability of HCUs and MCUs makes this architecture 
hardware friendly 

= Each НСО і an aggregation of 100 MCUs 

= The hyper column unit has 10000 input connections and 100 output 
connections 


HCU updates: 
Row-wise 
Col-wise 


Integrate decay 
calculation 


BCPNN Update 
Output Spike 
Computation 
—— 
100 MCUs 


HMCs consume =40KW 
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eBrain: Multi-chip fabric of BCUs connected by inter-BCU spike propagation network 


L: Number of BCU Chips 
M: Number of H-Tiles in each BCU 
N: Number of HCUs in each H-Tile 


[^] 
5 
= 

© 

с 

S 
o 
8 
o 
= 


E 


(15 MB) 


Inter-BCU 
Spike Propagation 
Network (SPN) 


System Controller 
JM to boot, initialize and 
save/restore the HCU 


= T 
МЕЗ Custom 3D-DRAM for eBRAIN II 


=" Custom-optimized 3D-DRAM architecture => 48 I/O DDR microChannel per 
HCU (1 — 2 mm? depending on the DRAM tech.) with 500MHz freq. 


= Tailored access > using a technique called “Row merge", where we balanced 
the BW between Row-updates and Col-updates. 


DRAM 
dies 
Matrix — Bank mapping of 4 HCUs: 
> optimized data layout 


# of HCUs єз єз са 
Mouse 1.6 x 103 13 W 


Bank 0 Bank 1 Bank 2 Bank 3 


Rat 5.0 x 103 44 W 
Cat 6.0 x 104 522 W 
Macaque 2.0 x 10° 1700 W 
Human 2.0 x 106 7KW»— 


Bank 4 Bank 5 Bank 6 Bank 7 


Device Capacity (bits) 
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= New memory technologies: 


"- — PCM 
š зш» | — 3DXPoint 
B | — STT-MRAM 
87 — ReRAM 
шә I Жж ||. | = DRAM won't be dead, but will change 
"| T ебет © ~ ' " wWieodemm(eg = © its role > maybe used as Cache... 


= New memory ECC techniques 


low endurance & reliability 


са = Heterogeneous main memory systems: 


Non Volatile 


Memory 
e.g. Flash medium endurance & reliability — NVDIMM-P 
medium latency and density 
SD-Stacked high endurance & reliability — SD MPSoCs / 3D Memory Stacks 
ReRAM low latency and density : 
= New requirements on: 
3D-Stacked 
DRAM — Compiler 
& ADRAM л 15 Cache and Unified 
Memory-Controller (MC) е» DS 


= Processing in memory (PIM) 


— / рне) with Cache ear 
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In/Near-Memory Processing 


SEE, 
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“e, 
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BANK 7 
BANK 0 


DATA INPUT FROM 
DESERIALIZER LOGIC 


TRISTATE 
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8X 


DATA OUTPUT TO 
SERIALIZER LOGIC 


te 
ta 
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e? 
"3 
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=512 


bits bl I 


bits wi 


112512 


SUB ARRAY (0) 


айы wan> [F 
& VEQL 
K i N 


10017; 


SUB ARRAY (1) 


= For NN processing 
(e.g. Mult & Add) 
= Place special Logic between the 


sub-arrays 


= Maximize degree of parallel data 


processing e 


g. 512/1024 bit in 


DDR3/4 devices 


m" , 
Ж мскогиствомс In/Near-Memory Processing 


SYSTEMS DESIGN 
ШИ RESEARCH GROUP 


| 140um | 


b) DATA-Layout 
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IPx: Input Layer 

Wxy: Weight 

Hy: Hidden/Output Layer 
СОМ: Config Flag Layer Type 
WLx: Wordline (Mem. Row) 
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The data layout: 


= Weights correspondent to separate neurons (HO, H1, ..) are stored row-wise. 

= Process the inputs of a single neuron in parallel by reading the weights in 
parallel from the corresponding DRAM row 

= Currently 100 values (100 bits because of 1-bit weights) are accessed per 
row per clock cycle 

= Maximum bandwidth per clock cycle сап be 512/1024 bits for DDR3/DDR4 
minus the control bits!. 
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Neural Network Layout Implementation 


ISO+SH1 AND Operation First Adder Stage Seconde Adder Stage |Akt|SIc| ^ Driverstage + SH2 
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Current implementation: 


Input data precision, Ql [bits] 2 
Weights precision, QW [bits] 1 
Number of neurons per fully-connected layer, N 100 
Number of layers, L 10 
Access time per DRAM row, T [ns] 20 
Number of sub-arrays per bank, S 16 
Number of banks in DRAM device, B 8 


Taking into account that the current implementation allows to 
access N weights per sub block in S sub-arrays in B banks of a single 
device in parallel, the computed throughput is: 


N * 21 * S * B / T 1.28 TOP/s 


1 2 comes from addition and multiplication considered as separate operations 
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а си Summary — Take-away messages 


Approximate DRAM and optimized Refresh control can be 
used to trade-off BW vs. reliability 

ConGen methodology to improve BW and energy 

Custom 3D-DRAMSs have a large potential 


Hybrid/Heterogeneous architectures and Near/In-memory 
processing will be key 


Thank you for Listening 
For more information //ems.eit.uni-kl.de 
For the tools 
https://www.uni-kl.de/3d-dram/tools/ 


